home *** CD-ROM | disk | FTP | other *** search
Text File | 1995-03-15 | 17.8 KB | 405 lines | [TEXT/MMCC] |
- The String-extensions Library
-
- Copyright (c) 1994 Carnegie Mellon University
-
-
- Introduction
-
- String-extensions is a library of routines for working with characters
- and strings. String-extensions exports these modules:
- Conversions
- This module consists of various useful conversions involving
- strings.
- Character-type
- This module is a Dylanized version of the C library ctype.h
- String-hacking
- This module exports miscellanous functions and data structures
- that are useful when working with strings and characters.
- Regular-expressions
- This module contains various functions that deal with regular
- expressions (regexps).
- Substring-search
- This module contains methods for searching for fixed substrings
- rather than general regular expressions.
-
-
- The Conversions Module
-
- The Conversions module consists of various useful conversions
- involving strings. They are:
-
- string-to-integer(string, #key base) => integer [Function]
- integer-to-string(integer, #key base) => string [Function]
- digit-to-integer(character) => integer [Function]
- integer-to-digit(integer) => character [Function]
- Base defaults to 10, and is the radix for the number system to
- convert from/to. Bases below 2 are errors, as are bases above
- 36. When converting from a string, the string must exactly describe a
- number, with no excess characters. Digit-to-integer will signal an
- error if the digit is non-alphanumeric. Errors will be signalled
- for all invalid input.
-
- as(<string>, character) [G.F. Method]
- Turns a character into the appropriate string of length one.
-
-
-
- The Character-type Module
-
- Character-type is a Dylanized version of the C library ctype.h It
- contains the following functions:
- FUNCTION AND ARG TYPE RETURNS #t FOR THESE CHARACTERS
- alpha?(character) a-zA-Z
- digit?(character) 0-9
- alphanumeric?(character) a-zA-Z0-9
- whitespace?(character) Space, tab, newline, formfeed,
- carriage return
- uppercase?(character) A-Z
- lowercase?(character) a-z
- hex-digit?(character) 0-9a-f
- punctuation?(character) ,./<>?;\:"|'[]{}!@#$%^&*()-=_+`~
- graphic?(character) alphanumeric or punctuation
- printable?(character) graphic or whitespace
- control?(character) not printable
-
-
-
- String-hacking
-
- The String-hacking module exports miscellanous functions and data
- structures that are useful when working with strings and characters.
-
- add-last(stretchy-sequence, object) [Generic Function]
- => stretchy-sequence
- add-last(string, character) => string [G.F. Method]
- Like add except it's guarenteed to add the character to the
- end of the string.
-
- predecessor(character) => character [Function]
- Get the character before this character. Equivalent to
- as(<character>, -1 + as(<integer>, character))
-
- successor(character) => character [Function]
- Get the character after this character. Equivalent to
- as(<character>, 1 + as(<integer>, character))
-
- case-insensitive-equal(object1, object2) [Generic Function]
- case-insensitive-equal(string1, string2) [G.F. Method]
- case-insensitive-equal(character1, character2) [G.F. Method]
- Does a case insensitive equality test. Methods are provided only
- for strings and characters, not general collections.
-
- <character-set> [Sealed Abstract Class]
- <case-sensitive-character-set> [Class]
- <case-insensitive-character-set> [Class]
- A <character-set> is a non-mutable subclass of <collection>, and is
- conceptually an unordered set of characters. Dylan collection
- elements always have keys, so to fit sets into Dylan, the key of an
- element of a character set is the element itself. There are two
- instantiable subclasses of <character-set>,
- <case-sensitive-character-set> and
- <case-insensitive-character-set>. <character-set> is not
- instantiable; one must always specify one of the instantiable
- subclasses when creating a character set.
-
- There are two ways of making a character set. The first is a
- method for make using the description: keyword. The value that
- follows the description: keyword is a string that describes the set
- using a notation like a regular expression character set, except
- without the '[' and ']' delimiters. For example,
- make(<case-sensitive-character-set>, description: "a-z")
- would be the set of all lowercase alphabetic characters.
-
- A second way to create character sets is to use an "as" method.
- The as method basically takes a collection of characters and
- discards the keys of these characters. Example:
- as(<case-insensitive-character-set>,
- "abcdefghijklmnopqrstuvwxyz")
- is again the set of all lowercase alphabetic characters. It is
- important to realize that the as method does *not* take a
- description:
- as(<case-sensitive-character-set>, "a-z")
- returns the set of 'a', '-', and 'z', not the set of all alphabetic
- characters.
-
- The most useful operation on character sets is member?, which does
- what one would expect. Another useful operation is the
- forward-iteration-protocol. This basically calls member? on every
- possible character until it finds a character that is a member of
- the set. This means that in a <case-insensitive-character-set>,
- both 'a' and 'A' will come up.
-
- <byte-character-table> [Class]
- A byte-character-table is a vector that uses byte characters as
- indices instead of integers. The following are equivalent:
- regular-vector[as(<integer>, character)]
- byte-character-table[character]
- <byte-character-table> has absolutely no relation to <table>. It
- is simply a <mutable-explicit-key-collection>.
-
-
-
- Regular-expressions
-
- The Regular-expressions module contains various functions that deal
- with regular expressions (regexps). The module is based on Perl
- (version 4), and has the same semantics unless otherwise noted. The
- syntax for Perl-style regular expressions can be found on page 103 of
- Programming Perl by Larry Wall and Randal L. Schwartz. There are some
- differences in the way String-extensions handles regular expressions.
- The biggest difference is that regular expressions in Dylan are case
- insensitive by default. Also, when given an invalid regexp,
- String-extensions will produce undefined behavior while Perl would
- give an error message.
-
- There is some work involved in analyzing a regular expression, and if
- the same regexp is used repeatly with different target strings, this
- will result in wasted computation. For this reason, each basic
- function in the Regular-expression module comes with a companion
- function that makes using a regular expression more efficient when it
- is used more than once. For example, the regexp-replace function has
- the make-regexp-replacer companion function. There is one exception;
- the join function has no make-joiner function. The "make-fooer" will
- analyze the regular expression exactly once, and provide a function
- that makes use of this pre-analyzed regular expression. For example,
- the following two pieces of code yield the same result:
- regexp-position("This is a string", "is");
-
- let is-finder = make-regexp-positioner("is");
- is-finder("This is a string");
-
- However, the second form is more efficient if is-finder is called
- multpile times.
-
- regexp-position [Generic Function]
- (big-string, regexp, #key start, end, case-sensitive)
- => variable-number-of-marks-or-#f
-
- This function returns the index of the start of the regular
- expression in the big-string, or #f if the regular expression is
- not found. As a second value, it returns the index of the end of
- the regular expression in the big-string (assuming it was found;
- otherwise there is no second value). If there are groups in the
- regular expression, regexp-position will return two more values (a
- start and an end) for each group. If the subgroup is matched,
- these will be integers. So
- regexp-position("This is a string", "is");
- returns values(2, 4), and
- regexp-position("This is a string", "(is)(.*)ing");
- returns values(2, 16, 2, 4, 4, 13), while
- regexp-position("This is a string", "(not found)(.*)ing");
- returns #f. If the subgroup is not matched, however, both the
- start and the end will be #f. The marks are always given relative
- to the start of big-start, and not relative to the start: keyword.
-
- Start: and end: specify what part of big-string to look at, and
- they default to the beginning and end of the string, respectively.
- Case-sensitive defaults to false.
-
- make-regexp-positioner [Generic Function]
- (regexp, #key byte-characters-only, need-marks,
- maximum-compile, case-sensitive)
- => an anonymous positioner function
- method (big-string, #key start, end)
- Make-regexp-positioner can return several different types of
- positioners, and it is up to the user to specify what kind of
- positioner the user wants. By default, it returns a positioner
- that works like regexp-position. However, if need-marks is #f, it
- may give a positioner that only returns #t or #f, with no marks.
- (And then again, it may still return marks) If byte-characters-only
- is specified, the positioner may only work on big-strings that
- consist only of byte characters (characters whose numerical value
- is between 0 and 255, inclusive). And if maximum-compile is #t, it
- will take a long time to return a positioner, but the positioner
- will run really fast.
-
- regexp-replace [Generic Function]
- (big-string, search-for-regexp, replace-with-string,
- #key count, case-sensitive, start, end)
- => new-string
- This replaces all occurences of regexp in big-string with
- replace-string. If count: is specified, it replaces only the first
- count occurences of regexp. (This is different from Perl, which
- replaces only the first occurence unless /g is specified)
- Replace-string can contain backreferences to the regexp. For
- instance,
- regexp-replace("The rain in spain and some other text",
- "the (.*) in (\\w*\\b)", "\\2 has its \\1")
- returns "spain has its rain and some other text". If the subgroup
- referred to by the backreference was not matched, the reference is
- interpretted as the null string. For instance,
- regexp-replace("Hi there", "Hi there(, Bert)?",
- "What do you think\\1?")
- returns "What do you think?" because ", Bert" wasn't found.
-
-
- make-regexp-replacer [Generic Function]
- (regexp, #key replace-with, case-sensitive)
- => an anonymous replacer function that is either
- method (big-string, #key count, start, end)
- or
- method (big-string, replace-string, #key count, start, end)
- The first form is returned if the replace-with: keyword isn't
- supplied, otherwise the second form is returned. (There is no
- efficiency gained by supplying the replace-with string, but it
- might be convenient)
-
- translate(big-string, from-string, to-string, [Generic Function]
- #key delete, start, end)
- => new-string
- This is equivalent to Perl's tr/// construct. From-string is a
- string specification of a character set, and to-string is another
- character set. Translate converts big-string character by
- character, according to the sets. For instance,
- translate("any string", "a-z", "A-Z")
- will convert "any string" to all uppercase: "ANY STRING".
-
- Like Perl, character ranges are not allowed to be "backwards". The
- following is not legal:
- translate("any string", "a-z", "z-a")
- (This restriction may be removed in future releases) Unlike Perl's
- tr///, translate doesn't return the number of characters
- translated.
-
- If delete: is true, any characters in the from-string that don't
- have matching characters in the to-string are deleted. The
- following will remove all vowels from a string and convert periods
- to commas:
- translate("any string", ".aeiou", ",", delete: #t)
- Delete: is false by default. If delete: is false and there aren't
- enough characters in the to-string, the last character in the
- to-string is reused as many times as necessary. The following
- converts several punctuation characters into spaces:
- translate("any string", ",./:;[]{}()", " ");
- Start: and end: indicate which part of the string. They default to
- the entire string.
-
- Caveats: Translate is always case sensitive.
-
- translate [G.F. Method]
- (big-byte-string, from-byte-string, to-byte-string,
- #key delete, start, end)
- => new-string
- The only method of translate operates only on byte strings.
-
- make-translator [Generic Function]
- (from-string, to-string, #key delete)
- => an anonymous translator
- method (big-string, #key start, end) => new-string
- Does what you'd expect it to.
-
- make-translator [G.F. Method]
- (from-byte-string, to-byte-string, #key delete)
- => an anonymous translator
- method (big-string, #key start, end) => new-byte-string
- Again, the existing method on make-translator only handles byte
- strings.
-
- split [Generic Function]
- (regexp, big-string, #key count, remove-empty-items,
- case-sensitive, start, end)
- => a variable number of strings
- This is like Perl's split function. It searchs big-string from
- occurences of regexp, and returns substrings that were delimited by
- that regexp. For instance,
- split("-", "long-dylan-identifier")
- returns values("long", "dylan", "identifier"). Note that what
- matched the regexp is left out. Remove-empty-items, which defaults
- to true, magically skips over empty items, so that
- split("-", "long--with--multiple-dashes)
- returns values("long", "with", "multiple", "dashes"). Count is the
- maximum number of strings to return. If there are n strings and
- count is specified, the first count - 1 strings are returned as
- usual, and the count'th string is the remainder, unsplit. So
- split("-", "really-long-dylan-identifier", count: 3)
- returns values("really", "long", "dylan-identifier"). If
- remove-empty-items is true, empty items aren't counted.
-
- Case sensitive determines if the regexp for the delimiter should be
- considered case sensitive or not; it defaults to
- case-insensitive. Start: and end: indicate what part of the big
- string should be looked at for delimiters. They default to the
- entire string. For instance,
- split("-", "really-long-dylan-identifier", start: 8)
- returns values("really-long", "dylan", "identifier"). Caveat:
- Unlike Perl, empty regular expressions are never legal regular
- expressions, so there is no way to split a string into a #rest
- sequence-of-characters. Of course, in Dylan this is not a useful
- thing to do, so this is not really a problem.
-
- make-splitter [Generic Function]
- (pattern :: <string>, #key case-sensitive)
- => an anonymous splitter
- method (big-string, #key count, remove-empty-items,
- start, end) => buncha-strings
- Does what you would expect.
-
- join [Generic Function]
- (delimiter :: <string>, #rest strings) => big-string
- Does the opposite of a split.
- join(":", word1, word2, word3)
- is equivalent to
- concatenate(word1, ":", word2, ":", word3)
- (and no more efficient) Note that there is no make-joiner.
-
-
-
- Substring-search
-
- Substring search contains methods for searching for fixed substrings
- rather than general regular expressions. It is as similar to the
- regular expression module as we could make it. Substring functions
- work only on byte strings, and are always case sensitive. These
- functions were taken from the Collection-extensions library shipped in
- Mindy 1.1, but the parameters, keywords, and return values have
- changed significantly since then.
-
- substring-position [Generic Function]
- (big-string, search-for-string, #key start, end)
- => position-or-false;
- Returns the position of the search-for-string in the big-string (or
- that portion of the big-string specified by start: and end:). This
- search is always case sensitive.
-
- This function uses the Boyer-Moore algorithm for long strings, and
- a simple dumb search for short strings. It should yield good
- performance under all circumstances.
-
- make-substring-positioner (search-for-string) [Generic Function]
- => an anonymous positioner
- method (big-string, #key start, end) => position-or-false
- Does the obvious.
-
- substring-replace [Generic Function]
- (big-string, search-for-string, replace-with-string,
- #key count, start, end)
- => replaced-string
- Replaces the substring, or the first count instances of it if
- count: is specified. Note this function does not support start: or
- end:.
-
- make-substring-replacer [Generic Function]
- (search-for :: <byte-string>, #key replace-with)
- => an anonymous function replacer that is either
- method (big-string, #key count, start, end) => new-string
- or
- method (big-string, replace-with-string, #key count, start, end)
- Does the obvious.
-
-
-
- Known bugs
-
- Regular-expressions will do unpredictable things if given bad
- arguments. (ie, a string that isn't a legal regular expression)
- Sometimes it'll crash, and sometimes it'll merily chug away and
- return crazy answers.
-
- The regexp parser will happily accept a "quantified assertion," which
- isn't technically a legal regexp. However, both regular and compiled
- matching will handle it as one intuitively thinks it should be
- handled. (An example of a quantified assertion would be "^*", which
- matches "any number of beginning of line". Since "*" means "0 or
- more", "^*" is interpretted to mean "", which is how one would
- intuitively belive it should be interpretted.)
-